TL/UCP: Update topo aware ring algorithm by Juee14Desai · Pull Request #1288 · openucx/ucc

Juee14Desai · 2026-03-24T22:53:39Z

What

Add topology aware multi ring algorithms for allgather, reduce_scatter, and allreduce in TL/UCP. The ring algorithms use team->cuda_ring to route data along NVLink optimal paths with up to 8 parallel rings, instead of the default single ring.

Why ?

The default ring algorithms use a flat rank ordering that does not account for the underlying GPU interconnect topology. On multi GPU systems with NVLink, this results in suboptimal data routing transfers may traverse slower paths instead of direct NVLink links.

How ?

Allgather:

Rewritten to derive ring rank, peer, and block indices from the cuda_ring topology pattern instead of flat team rank.
Each of the up to 8 rings transfers its own sub block slice concurrently.
Algorithm auto selected for CUDA memory >4KB when cuda_ring is available via dynamic score string in allgather.c.
Service allgather decoupled into dedicated service_allgather_ring_start/progress functions in tl_ucp_service_coll.c so internal collectives continue using the flat rank ring.

Count	Size (bytes)	UCC master Time avg (us)	UCC master BW avg (GB/s)	UCC this PR Time avg (us)	UCC this PR BW avg (GB/s)
1048576	4194304	1318.36	47.72	1107.44	56.81
2097152	8388608	2527.03	49.79	1355.50	92.83
4194304	16777216	4946.76	50.87	1738.13	144.79
8388608	33554432	9760.98	51.56	2051.20	245.38
16777216	67108864	19394.07	51.90	3189.02	315.66
33554432	134217728	38630.25	52.12	5770.31	348.90
67108864	268435456	77089.72	52.23	10886.80	369.85

Reduce_scatter:

Rewritten to use cuda_ring for multi ring topology aware transfers.
Each ring handles its own sub block slice with per ring GPU reductions via the executor before forwarding to the next peer.
Scratch buffer management simplified to a single ucc_mc_alloc/free per task lifetime.

Count	Size (bytes)	UCC master Time avg (us)	UCC master BW avg (GB/s)	UCC this PR Time avg (us)	UCC this PR BW avg (GB/s)
16777216	67108864	4500.15	13.98	9472.58	6.64
33554432	134217728	4724.07	26.64	6875.58	18.30
67108864	268435456	6985.83	36.02	7527.26	33.43
134217728	536870912	11992.22	41.97	8311.25	60.56
268435456	1073741824	22032.55	45.69	10223.47	98.46
536870912	2147483648	42097.26	47.82	14125.28	142.53
1073741824	4294967296	82387.74	48.87	23308.91	172.75

Allreduce:

Monolithic implementation that fuses reduce_scatter and allgather into a single task/progress function, avoiding schedule overhead.
Phase 0 receives into scratch, reduces with the local dst block via GPU executor, then forwards the accumulated result. Phase 1 performs an in-place ring allgather.
Tagged send/recv counters are reset at the phase transition.
Auto selected for CUDA memory >4KB via dynamic score string in allreduce.c.

Count	Size (bytes)	UCC master Time avg (us)	UCC master BW avg (GB/s)	UCC this PR Time avg (us)	UCC this PR BW avg (GB/s)
1048576	4194304	N/A	N/A	999.99	7.86
2097152	8388608	N/A	N/A	1095.41	14.36
4194304	16777216	N/A	N/A	1290.22	24.38
8388608	33554432	N/A	N/A	4255.08	14.79
16777216	67108864	N/A	N/A	921.83	136.50
33554432	134217728	N/A	N/A	1796.86	140.05
67108864	268435456	N/A	N/A	3513.99	143.23

Juee14Desai · 2026-03-24T23:03:48Z

/build

wfaderhold21 · 2026-03-24T23:13:30Z

@Juee14Desai @janjust I assume this will replace PR #1258 ?

janjust · 2026-03-24T23:15:34Z

We talked about this, it shouldn't need to. If they are both good to go then let's have both ring and topo-aware ring and we can phase one out as needed.

Juee14Desai · 2026-04-01T20:51:46Z

/build

greptile-apps · 2026-04-02T16:37:00Z

Greptile Summary

This PR adds topology-aware multi-ring algorithms for allgather, reduce_scatter, and allreduce in TL/UCP, routing transfers along NVLink-optimal paths (up to 8 parallel rings via team->cuda_ring) and decouples the service allgather from the topo ring.

Allgather and reduce_scatter both gain dynamic score-string functions that auto-select the topo ring for CUDA memory >4KB when cuda_ring is present; the service allgather is correctly preserved as a separate flat-ring path.
Allreduce ring is implemented as a schedule chaining the new RS and AG sub-tasks, but no dynamic score-string function was added to tl_ucp_coll.c \u2014 the algorithm is registered but never automatically selected, so the benchmark gains shown in the description will not materialize without explicit user configuration.
Several correctness issues flagged in prior review rounds (source-buffer corruption for non-in-place RS, missing count-divisibility check in allgather, algorithm regression for odd-size CPU teams) remain unresolved in this revision.

Confidence Score: 3/5

The allreduce ring is never auto-selected and multiple correctness issues from prior reviews remain open.

The allreduce ring algorithm is fully implemented but dead for automatic use: tl_ucp_coll.c has no str_get_fn for allreduce, so the ring is silently skipped. The non-in-place reduce_scatter source-buffer corruption and the allgather count-truncation issue identified in previous rounds are still present.

src/components/tl/ucp/tl_ucp_coll.c (missing allreduce score-string integration), src/components/tl/ucp/reduce_scatter/reduce_scatter_ring.c (non-in-place source corruption), src/components/tl/ucp/allgather/allgather_ring.c (count divisibility)

Important Files Changed

Filename	Overview
src/components/tl/ucp/tl_ucp_coll.c	Wires up the new reduce_scatter dynamic score string correctly, but the allreduce ring algorithm has no corresponding str_get_fn — the ring allreduce is never auto-selected for CUDA workloads.
src/components/tl/ucp/allgather/allgather_ring.c	Rewrites allgather ring to use the cuda_ring topology pattern with up to 8 parallel rings; start counter sentinel (1) and progress loop logic are internally consistent, but count-divisibility is not checked (flagged in previous review).
src/components/tl/ucp/reduce_scatter/reduce_scatter_ring.c	Complete rewrite to topology-aware multi-ring; non-in-place path still corrupts source buffer (flagged in previous review); ping-pong scratch for persistent is correct; missing send==recv symmetry assertion.
src/components/tl/ucp/allreduce/allreduce_ring.c	New schedule-based allreduce ring that chains RS+AG sub-tasks; functionally correct for in-place CUDA workloads, but never auto-selected because no dynamic score string was added to tl_ucp_coll.c; also lacks early cuda_ring guard before schedule allocation.
src/components/tl/ucp/tl_ucp_service_coll.c	Extracts a dedicated service_allgather_ring_start/progress so internal collectives continue using the flat ring; function pointers are correctly initialized before use.

_{Reviews (6): Last reviewed commit: "TL/UCP: topo aware ring algo for reduce_..." | Re-trigger Greptile}

greptile-apps · 2026-04-02T16:37:08Z

+
+            send_idx    = ucc_ring_pattern_get_send_block(ring, ring_id,
+                                                          rrank, step);
+            ring_offset = ucc_buffer_block_offset(block_cnt, nrings, ring_id);


Missing send_posted == recv_posted assertion

The allgather ring progress function asserts task->tagged.send_posted == task->tagged.recv_posted at the top of its loop (since sends and recvs are always posted in pairs). The allreduce ring's RS-phase loop has the send_posted > 0 / recv_posted > 0 guards but omits the equality check, making it harder to catch a bookkeeping bug early. Consider adding the same assertion for consistency and diagnosability.

greptile-apps · 2026-04-16T22:50:48Z

+                reduce_target = PTR_OFFSET(
+                    sbuf,
+                    (recv_block * block_cnt + ring_offset) * dt_size);


Non-in-place operation silently corrupts the source buffer

For non-in-place reduce_scatter, sbuf = args->src.info.buffer. On every non-final step the reduction writes back into that same buffer:

reduce_target = PTR_OFFSET(sbuf, (recv_block * block_cnt + ring_offset) * dt_size);

The subsequent send loop then reads from the now-mutated sbuf location to forward the intermediate partial sum. The old implementation accumulated partial results in the dedicated s_scratch send buffer and left sbuf read-only. Callers with separate source and destination buffers (i.e., any non-in-place reduce_scatter over CUDA memory when cuda_ring is present) will have their source data silently overwritten.

The simplest safe fix is to reject non-in-place in reduce_scatter_ring_init_common (analogous to the UCC_IS_PERSISTENT guard already there):

if (!UCC_IS_INPLACE(*args)) { return UCC_ERR_NOT_SUPPORTED; }

Or alternatively, allocate a separate per-ring work buffer for intermediate reductions instead of reusing sbuf.

Juee14Desai · 2026-04-16T22:51:08Z

/build

Juee14Desai · 2026-04-16T23:08:01Z

/build

janjust · 2026-04-30T15:31:18Z

/build

Juee14Desai · 2026-04-30T21:53:12Z

/build

Juee14Desai · 2026-04-30T23:36:38Z

/build

Juee14Desai · 2026-05-01T05:49:01Z

/build

greptile-apps · 2026-05-01T05:57:02Z

        }
    }

+    if (algo_num == UCC_TL_UCP_ALLGATHER_ALG_RING && !team->cuda_ring) {


Algorithm regression for odd-size non-CUDA teams

When cuda_ring is NULL (any CPU team, or GPU team without NVLink topology), odd-size teams previously used the flat RING algorithm. After this guard, they fall through to KNOMIAL. This silently changes the default for every odd-size non-CUDA workload, including large CPU clusters.

The underlying reason is that allgather_ring_init_common now hard-fails with UCC_ERR_NOT_SUPPORTED when cuda_ring == NULL, so the score-string must avoid selecting it. However, the correct fallback for the "no-topology-info" case is still the flat ring (old behavior), not knomial. Consider keeping the flat-ring algorithm alive under a separate name/enum and using it as the fallback, or only switching to KNOMIAL when the caller is guaranteed to benefit.

Replace the default ring allgather with a topo aware multi ring implementation that uses team->cuda_ring to route data along NVLink optimal paths (up to 8 parallel rings). Algorithm changes: - Ring rank, peer, and block indices are now derived from the cuda_ring topology pattern instead of flat team rank ordering. - Each ring transfers its own slice of each block, enabling concurrent data movement across multiple NVLink paths. - Algorithm auto selected for CUDA memory >4KB when cuda_ring is available; falls back to knomial otherwise. Also fixes CUDA primary context detection in ucc_sysinfo_cuda.c and decouples the service allgather from the topo aware ring. Signed-off-by: Juee Himalbhai Desai <jueehimalbha@nvidia.com>

Replace the default ring reduce_scatter with a topo aware multi ring implementation that uses team->cuda_ring to route data along NVLink optimal paths (up to 8 parallel rings). Algorithm changes: - Ring rank, peer, and block indices are now derived from the cuda_ring topology pattern instead of flat team rank ordering. - Each ring handles its own sub block slice, with per ring GPU reductions via the executor before forwarding to the next peer. - Scratch buffer management simplified to a single mc_alloc/free per task lifetime (removed fragmentation logic). Signed-off-by: Juee Himalbhai Desai <jueehimalbha@nvidia.com>

wfaderhold21

If I understand correct, this PR is removing CPU-based allgather/reduce_scatter ring algorithms. We will need a future PR to bring those back before 1.9.0 release.

wfaderhold21 · 2026-06-03T20:58:27Z

+    ucc_rank_t          tsize  = ucc_ring_pattern_size(ring, 0);
+    ucc_rank_t          block = UCC_TL_TEAM_RANK(team);
+    size_t              data_size = (count / tsize) * ucc_dt_size(dt);
    ucc_status_t       status;


wfaderhold21 · 2026-06-03T21:10:36Z

@@ -1,5 +1,5 @@
 /**
- * Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * Copyright (c) 2021-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.


wfaderhold21 · 2026-06-03T21:11:50Z

+    if (!UCC_IS_INPLACE(*args)) {
        status = ucc_mc_memcpy(PTR_OFFSET(rbuf, data_size * block),
-                               sbuf, data_size, rmem, smem);
+                              sbuf, data_size, rmem, smem);


wfaderhold21 · 2026-06-03T21:12:50Z

@@ -1,424 +1,308 @@
 /**
- * Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.


wfaderhold21 · 2026-06-03T21:23:22Z

+                                                              step - 1);
+                data_displ  = (send_block * block_cnt + ring_offset) * dt_size;
+                send_src     = PTR_OFFSET(sbuf, data_displ);
+                next_recv_dst = PTR_OFFSET(scratch, ring_offset * dt_size);


wfaderhold21 · 2026-06-03T21:24:02Z

+    void               *scratch  = task->reduce_scatter_ring.scratch;
+    ucc_rank_t          ring_id, rrank, adj_rrank, send_block, sendto, recvfrom;
+    size_t              ring_offset, ring_count, data_displ, data_size;
+    ucc_status_t        status;


wfaderhold21 · 2026-06-03T21:25:10Z

-    ucc_tl_ucp_team_t *tl_team;
-    ucc_status_t       status;
+    ucc_tl_ucp_team_t     *team = TASK_TEAM(task);
+    ucc_coll_args_t        *args = &TASK_ARGS(task);


wfaderhold21 · 2026-06-03T21:29:43Z

+    size_t             count = TASK_ARGS(task).dst.info.count;
+    size_t             data_size = (count / tsize) * ucc_dt_size(TASK_ARGS(task).dst.info.datatype);
+    ucc_rank_t         sendto, recvfrom, sblock, rblock;
+    int                step;


Juee14Desai requested a review from Sergei-Lebedev March 24, 2026 22:53

Juee14Desai force-pushed the ucc-ring-algo branch from 6c2242e to 843043c Compare March 24, 2026 23:02

Juee14Desai force-pushed the ucc-ring-algo branch 3 times, most recently from b80e124 to 9472c37 Compare March 31, 2026 05:48

Juee14Desai force-pushed the ucc-ring-algo branch 2 times, most recently from bd96bad to e827813 Compare April 1, 2026 20:51

janjust requested a review from MamziB April 2, 2026 15:43

Sergei-Lebedev added the ai-review start ai-review label Apr 2, 2026

greptile-apps Bot reviewed Apr 2, 2026

View reviewed changes

MamziB reviewed Apr 9, 2026

View reviewed changes

Comment thread src/components/tl/ucp/reduce_scatter/reduce_scatter.c

MamziB reviewed Apr 9, 2026

View reviewed changes

Comment thread src/components/tl/ucp/allgather/allgather.c

wfaderhold21 reviewed Apr 9, 2026

View reviewed changes

Juee14Desai force-pushed the ucc-ring-algo branch from e827813 to 282b683 Compare April 16, 2026 22:41

greptile-apps Bot reviewed Apr 16, 2026

View reviewed changes

Juee14Desai force-pushed the ucc-ring-algo branch from 282b683 to 453ac19 Compare April 30, 2026 21:33

Juee14Desai force-pushed the ucc-ring-algo branch from 453ac19 to e85aca1 Compare April 30, 2026 23:36

Juee14Desai force-pushed the ucc-ring-algo branch from e85aca1 to 9342df0 Compare May 1, 2026 05:48

greptile-apps Bot reviewed May 1, 2026

View reviewed changes

Juee14Desai added 2 commits June 3, 2026 12:30

janjust force-pushed the ucc-ring-algo branch from 9342df0 to ca3428e Compare June 3, 2026 17:30

janjust approved these changes Jun 3, 2026

View reviewed changes

janjust added the Target v1.9.0 label Jun 3, 2026

wfaderhold21 approved these changes Jun 3, 2026

View reviewed changes

Uh oh!

Conversation

Juee14Desai commented Mar 24, 2026

What

Why ?

How ?

Uh oh!

Juee14Desai commented Mar 24, 2026

Uh oh!

wfaderhold21 commented Mar 24, 2026

Uh oh!

janjust commented Mar 24, 2026

Uh oh!

Juee14Desai commented Apr 1, 2026

Uh oh!

greptile-apps Bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Uh oh!

Uh oh!

greptile-apps Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Juee14Desai commented Apr 16, 2026

Uh oh!

Juee14Desai commented Apr 16, 2026

Uh oh!

janjust commented Apr 30, 2026

Uh oh!

Juee14Desai commented Apr 30, 2026

Uh oh!

Juee14Desai commented Apr 30, 2026

Uh oh!

Juee14Desai commented May 1, 2026

Uh oh!

greptile-apps Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

wfaderhold21 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

greptile-apps Bot commented Apr 2, 2026 •

edited

Loading